Abschlussprojekt¶

Du arbeitest als Data Scientist bei einer P-2-P-Plattform, die vor einem Jahr gegründet wurde. Nun wollt ihr euer Geschäft erweitern. Euer Team hat sich aufgeteilt und jeder Analyst hat einen Teilbereich der Daten. Deine Aufgabe ist es in einer explorativen Datenanalyse Insights für eure Plattform herauszufinden.

Euer Geschäftsmodell ist das Betreiben einer Plattform (crowd-investing) bei der sich Personen die eine Geschäftsidee haben, aber nicht das benötigte Geld, anmelden und für ihr Projekt innerhalb einer vorgegebenen Zeit Geld sammeln können. Auf der anderen Seite habt ihr Geldgeber, die gern ihr Geld in Projekte anlegen möchten und nach Investitionen suchen. Als Vermittler bringt eure Plattform also Geldnehmer und Geldgeber zusammen. Ihr verdient euer Geld mit einer Provision für jedes Projekt was auf eurer Plattform landet. Das Rückzahlungsrisiko wird über den Zinssatz abgebildet.

Deine Datenbasis ist die Historie eurer Plattform. Alle Projekte sind abgeschlossene Projekte, d.h. die Zeit, um für sein Projekt Geld zu sammeln ist abgelaufen. Euer Geschäftsmodell sieht es vor, dass die gesammelten Gelder ausgezahlt werden, auch wenn der Zielbetrag nicht erreicht wurde.

Dafür ist der Datensatz zunächst vorzubereiten. In die Bewertung geht zudem die Dokumentation mit ein. Für die Visualisierung sind 3 verschiedenartige Plots zu benennen, die in die Bewertung eingehen sollen. Das bedeutet nicht, dass du nur 3 Plots ersellen darfst - ausschließlich für die Bewertung sind drei verschiedenartige Plots zu benennen! Weiterhin sollen es verschiedenartige Plots sein! Das bedeuted, ein Balkendiagramm und ein grouped Balkendiagramm sind am Ende des Tages Balkendiagramme. Im Abschlussprojekt zeigst du nicht, dass du eine Art von Plot in verschiedenen Varianten darstellen kannst, sondern, dass du verschiedene Plots kannst.

Vergiss dabei nicht die Erkenntnisse die aus jedem Plot gezogen werden können, zu notieren.

Die Dokumentation kann in englisch oder deutsch erfolgen.

Am Freitag erfolgt die Vorstellung der Plots innerhalb des Zeitrahmens von 5min - (+-1min) ist dabei ok.

Treff ist Donnerstag um 14:30 Uhr im Hauptraum. Da werden wir die Unterlagen in dokumentensichere Formate umwandeln. Das machen wir aber zusammen. Danach habt ihr bis 15:35 Uhr zeit, eure zu bewertenden Unterlagen in den Projektabgabeordner hochzuladen.

Der gesplittete Datensatz enthält folgende Spalten (inkl. Bedeutung):

- funded_amount ... mit Ablauf der "Crowding"zeit erhaltener Betrag/ ausgezahlter Betrag in USD
- loan_amount ... Zielbetrag (Betrag dem man für das Projekt erreichen wollte) in USD
- activity ... Unterkategory zu dem das Ziel des Crowdprojektes thematisch gehört
- sector ... Oberkategory in den das Crowdprojektes Thema fällt
- use ... Kurzbeschreibung wofür das Geld verwendet werden soll
- country_code ... Ländercode nach ISO Norm
- country ... Ländername nach ISO Norm
- region ... Region
- currency ... Währung in den der funded_amount dann ausgezahlt wurde
- term in months ... Dauer über die der Kredit ausgezahlt werden soll
- lender_count ...Darlehensgeber (also wieviele Personen Geld für das Projekt gegeben haben)
- borrower_genders ... Geschlecht und Anzahl der Darlehensnehmer, also diejenigen die das Crowdprojekt initiiert haben       
- repayment interval ... vertraglich vereinbarte Rückzahlungsmodalitäten/-rhythmus

to do:

  1. Data Preprocessing
                                    1.1) Datensatz einlesen
                                    1.2) fehlende Werte
                                    1.3) Ausreißer
                                    1.4) Pairplot - erste visuelle Inspektion
                                    1.5) Features
                                    1.6) Speicherplatz

1. Data Preprocessing¶

1.1. Datensatz einlesen¶

  • 2 csv Dateien --> mit welcher funktion lese ich die daten ein?
  • wie hängen die dateien miteinander zusammen? --> WIE muss ich die dateien zusammen fügen zu einem dataframe?
In [1]:
# benötigte bib importieren

import pandas as pd 
import numpy as np 
from collections import Counter

# benötigte bib für Visualisierung

import seaborn as sns 
import plotly.io as pio
import plotly.express as px

pio.renderers.default = "notebook"
In [2]:
# csv einlesen und vorher trennzeichen identifizieren

file_1 = "your_part1.csv"

df1 = pd.read_csv(file_1, 
                 sep="/n", 
                 engine='python',
                 nrows=2)
df1

# , ist trennzeichen 
Out[2]:
, funded_amount, loan_amount, activity, sector, use, country_code, country, region, currency, term_in_months, lender_count, borrower_genders, repayment_interval
0 0,300.0,300.0,Fruits & Vegetables,Food,"To buy...
1 1,575.0,575.0,Rickshaw,Transportation,to repai...
In [3]:
# df mit korrektem trennzeichen einlesen

df1 = pd.read_csv(file_1, 
                  sep=",", 
                  index_col=0)
df1.head(3)

# parameter index_col=0 führt dazu, dass die erste spalte (hier der index) als Index verwendet wird und somit nicht
# als "Unnamed 0" spalte auftaucht
Out[3]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
In [4]:
# csv einlesen und vorher trennzeichen identifizieren

file_2 = "your_part2.csv"

df2 = pd.read_csv(file_2, 
                 sep="/n", 
                 engine='python',
                 nrows=2)
df2

# # ist trennzeichen 
Out[4]:
# funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval
0 0#175.0#175.0#Liquor Store / Off-License#Food#...
1 1#325.0#325.0#Livestock#Agriculture#to buy 3 z...
In [5]:
df2 = pd.read_csv(file_2, 
                  sep="#", 
                  index_col=0)
df2.head(3)
Out[5]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 175.0 175.0 Liquor Store / Off-License Food to purchase additional stock of coconut wine t... PH Philippines Palo, Leyte PHP 8.0 6 female irregular
1 325.0 325.0 Livestock Agriculture to buy 3 zebus and food to fatten them up. MG Madagascar Antsirabe MGA 12.0 13 female monthly
2 550.0 550.0 Food Stall Food to buy ingredients for her food-vending busine... PH Philippines Cordova, Cebu PHP 5.0 6 female irregular
In [6]:
df1.columns
Out[6]:
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
       ' country_code', ' country', ' region', ' currency', ' term_in_months',
       ' lender_count', ' borrower_genders', ' repayment_interval'],
      dtype='object')
In [7]:
df2.columns
Out[7]:
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
       ' country_code', ' country', ' region', ' currency', ' term_in_months',
       ' lender_count', ' borrower_genders', ' repayment_interval'],
      dtype='object')

All the column names are same so we should join dataframes by index.

In [8]:
df = pd.concat([df1, df2], axis= 0).reset_index(drop=True)
df.head(3)
Out[8]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
In [9]:
df.shape
Out[9]:
(671205, 13)
In [10]:
df.columns
Out[10]:
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
       ' country_code', ' country', ' region', ' currency', ' term_in_months',
       ' lender_count', ' borrower_genders', ' repayment_interval'],
      dtype='object')

Spaces in column names shuold be deleted

In [11]:
df.columns = df.columns.str.strip()
In [12]:
df.columns
Out[12]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
       'country_code', 'country', 'region', 'currency', 'term_in_months',
       'lender_count', 'borrower_genders', 'repayment_interval'],
      dtype='object')

Check for duplicates

In [13]:
df.loc[df.duplicated()==True,:].head(3)
Out[13]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
498 100.0 100.0 Home Energy Personal Use to buy a solar-powered lamp. SV El Salvador NaN USD 14.0 4 male monthly
606 100.0 100.0 Home Energy Personal Use to buy a solar-powered lamp. SV El Salvador NaN USD 14.0 4 male monthly
808 450.0 450.0 Higher education costs Education to pay for one semester's registration fees. CO Colombia Bogotà COP 7.0 15 female monthly
In [ ]:
 
In [14]:
# df.drop(index=df.loc[df.duplicated() == True].index, inplace=True)
df.drop_duplicates(inplace=True, ignore_index=True)
In [15]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 646833 entries, 0 to 646832
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       646833 non-null  float64
 1   loan_amount         646833 non-null  float64
 2   activity            646833 non-null  object 
 3   sector              646833 non-null  object 
 4   use                 642934 non-null  object 
 5   country_code        646825 non-null  object 
 6   country             646833 non-null  object 
 7   region              590670 non-null  object 
 8   currency            646833 non-null  object 
 9   term_in_months      646833 non-null  float64
 10  lender_count        646833 non-null  int64  
 11  borrower_genders    642945 non-null  object 
 12  repayment_interval  646833 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 64.2+ MB

1.2. Missing Values¶

First finfings (Missing values):

  1. It seems that funded_amount, loan_amount has no problems.
  2. activity and sector, we should check for unique values
  3. We should check, use, region, country_code and borrower_genders for NaN values
In [16]:
# Check activity for the strange value
df.activity.unique()
Out[16]:
array(['Fruits & Vegetables', 'Rickshaw', 'Transportation', 'Embroidery',
       'Milk Sales', 'Services', 'Dairy', 'Beauty Salon', 'Manufacturing',
       'Food Production/Sales', 'Wholesale', 'General Store',
       'Clothing Sales', 'Poultry', 'Tailoring', 'Sewing', 'Bakery',
       'Restaurant', 'Food Stall', 'Farming', 'Construction Supplies',
       'Personal Products Sales', 'Home Products Sales',
       'Natural Medicines', 'Fish Selling', 'Education provider',
       'Shoe Sales', 'Machinery Rental', 'Butcher Shop', 'Pigs',
       'Personal Expenses', 'Food Market', 'Cosmetics Sales',
       'Personal Housing Expenses', 'Retail', 'Energy', 'Grocery Store',
       'Construction', 'Agriculture', 'Motorcycle Transport',
       'Charcoal Sales', 'Food', 'Pharmacy', 'Fishing', 'Timber Sales',
       'Cattle', 'Electronics Repair', 'Electronics Sales', 'Vehicle',
       'Cafe', 'Blacksmith', 'Higher education costs', 'Used Clothing',
       'Fuel/Firewood', 'Upholstery', 'Catering', 'Animal Sales',
       'Cereals', 'Vehicle Repairs', 'Arts',
       'Cloth & Dressmaking Supplies', 'Mobile Phones', 'Spare Parts',
       'Clothing', 'Metal Shop', 'Barber Shop', 'Furniture Making',
       'Crafts', 'Home Energy', 'Home Appliances', 'Wedding Expenses',
       'Taxi', 'Secretarial Services', 'Livestock', 'Property',
       'Recycling', 'Farm Supplies', 'Auto Repair', 'Beverages',
       'Plastics Sales', 'Electrical Goods', 'Carpentry', 'Photography',
       'Jewelry', 'Bricks', 'Pub', 'Phone Use Sales',
       'Water Distribution', 'Paper Sales', 'Computers',
       'Liquor Store / Off-License', 'Utilities', 'Knitting', 'Weaving',
       'Party Supplies', 'Medical Clinic', 'Internet Cafe',
       'Consumer Goods', 'Cement', 'Electrician',
       'Primary/secondary school costs', 'Veterinary Sales',
       'Land Rental', 'Laundry', 'Call Center', 'Perfumes', 'Hotel',
       'Motorcycle Repair', 'Movie Tapes & DVDs', 'Quarrying',
       'Personal Medical Expenses', 'Bookstore', 'Decorations Sales',
       'Recycled Materials', 'Office Supplies', 'Souvenir Sales',
       'Renewable Energy Products', 'Health', 'Printing', 'Phone Repair',
       'Traveling Sales', 'Flowers', 'Bicycle Repair', 'Entertainment',
       'Phone Accessories', 'Hardware', 'Used Shoes',
       'Music Discs & Tapes', 'Games', 'Balut-Making', 'Textiles',
       'Child Care', 'Goods Distribution', 'Florist', 'Cobbler', 'Dental',
       'Bookbinding', 'Cheese Making', 'Bicycle Sales', 'Well digging',
       'Technology', 'Musical Performance', 'Waste Management', 'Film',
       'Tourism', 'Musical Instruments', 'Religious Articles',
       'Machine Shop', 'Cleaning Services', 'Sporting Good Sales',
       'Patchwork', 'Funerals', 'Air Conditioning', 'Communications',
       'Adult Care', 'Landscaping / Gardening', 'Aquaculture',
       'Beekeeping', 'Event Planning', 'Celebrations', 'Computer',
       'Personal Care Products', 'Mobile Transactions'], dtype=object)
In [17]:
# Check sector for the strange value
df.sector.unique()
Out[17]:
array(['Food', 'Transportation', 'Arts', 'Services', 'Agriculture',
       'Manufacturing', 'Wholesale', 'Retail', 'Clothing', 'Construction',
       'Health', 'Education', 'Personal Use', 'Housing', 'Entertainment'],
      dtype=object)
In [18]:
df.currency.unique()
Out[18]:
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
       'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
       'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
       'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
       'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
       'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
       'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
       'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
In [19]:
df.borrower_genders.unique()
Out[19]:
array(['female', 'female, female', 'female, female, female', ...,
       'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
       'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
       'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
      dtype=object)
In [20]:
df.repayment_interval.unique()
Out[20]:
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
In [21]:
# Check the nulls of columns
df.isnull().sum()
Out[21]:
funded_amount             0
loan_amount               0
activity                  0
sector                    0
use                    3899
country_code              8
country                   0
region                56163
currency                  0
term_in_months            0
lender_count              0
borrower_genders       3888
repayment_interval        0
dtype: int64
In [22]:
# Check all nulls happens at the same time
df.loc[(df.borrower_genders.isnull()) & (df.region.isnull()) & (df.region.isnull()),:].count()
Out[22]:
funded_amount         3888
loan_amount           3888
activity              3888
sector                3888
use                      0
country_code          3888
country               3888
region                   0
currency              3888
term_in_months        3888
lender_count          3888
borrower_genders         0
repayment_interval    3888
dtype: int64

NaN values: So what we see here is Region column has too many NaNs and not good, so we should check if we can delete this column. The null values in other columns can be deleted as rows The null values in Country code should be checked.

Schritt 1: fehlende Werte identifizieren¶

In [23]:
df.loc[df.region.isna() == True,:]
Out[23]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
5 250.0 250.0 Services Services purchase leather for my business using ksh 20000. KE Kenya NaN KES 4.0 6 female irregular
49 450.0 450.0 General Store Retail to stock his store. SV El Salvador NaN USD 14.0 18 male monthly
54 225.0 225.0 Food Market Food to purchase various seasonal items to resell: ... SN Senegal NaN XOF 14.0 7 female monthly
67 125.0 125.0 Energy Services purchase solar lanterns for resale. KE Kenya NaN KES 3.0 6 male irregular
70 2000.0 2000.0 Retail Retail to install a display window and a sunshade for... IQ Iraq NaN USD 15.0 71 male monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
646822 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
646823 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya NaN KES 13.0 0 female, female monthly
646824 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya NaN KES 13.0 0 female monthly
646825 0.0 25.0 Games Entertainment Kiva Coordinator replaced loan use. Should see... KE Kenya NaN KES 13.0 0 female, female monthly
646826 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly

56163 rows × 13 columns

In [24]:
df.loc[df.borrower_genders.isna() == True,:]
Out[24]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
140 2975.0 2975.0 Food Production/Sales Food NaN TZ Tanzania NaN TZS 10.0 110 NaN monthly
145 1200.0 1200.0 Personal Expenses Personal Use NaN PE Peru NaN PEN 20.0 44 NaN monthly
170 4250.0 4250.0 Catering Food NaN TZ Tanzania NaN TZS 10.0 116 NaN monthly
412 2350.0 2350.0 Beauty Salon Services NaN TZ Tanzania NaN TZS 10.0 75 NaN monthly
414 725.0 725.0 Agriculture Agriculture NaN SV El Salvador NaN USD 20.0 19 NaN monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
636081 5625.0 10000.0 Weaving Arts NaN BT Bhutan NaN USD 14.0 210 NaN irregular
637157 1975.0 1975.0 Home Energy Personal Use NaN PS Palestine NaN USD 27.0 39 NaN monthly
637995 800.0 1600.0 Furniture Making Manufacturing NaN HT Haiti NaN HTG 13.0 27 NaN irregular
646812 0.0 25.0 Livestock Agriculture NaN KE Kenya NaN KES 13.0 0 NaN monthly
646826 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly

3888 rows × 13 columns

In [25]:
# Check the numerical columns for strange words
cols = ["funded_amount", "loan_amount", "term_in_months", "lender_count"]
for x in cols:
    signs = []
    for element in df.loc[:, x]:
        
        try:
            float(element)   
        except:
            signs.append(element)
    print(f'{x}: {signs}')
 
funded_amount: []
loan_amount: []
term_in_months: []
lender_count: []

Lets check region if we should delete or not.¶

In [26]:
df.loc[df.region.isnull(),:].head(5)
Out[26]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
5 250.0 250.0 Services Services purchase leather for my business using ksh 20000. KE Kenya NaN KES 4.0 6 female irregular
49 450.0 450.0 General Store Retail to stock his store. SV El Salvador NaN USD 14.0 18 male monthly
54 225.0 225.0 Food Market Food to purchase various seasonal items to resell: ... SN Senegal NaN XOF 14.0 7 female monthly
67 125.0 125.0 Energy Services purchase solar lanterns for resale. KE Kenya NaN KES 3.0 6 male irregular
70 2000.0 2000.0 Retail Retail to install a display window and a sunshade for... IQ Iraq NaN USD 15.0 71 male monthly
In [27]:
df.groupby(["country", "region"]).agg(nof_cr=("region", "size"))
Out[27]:
nof_cr
country region
Afghanistan Kandahar City 2
Albania Cerrik 1
Elbasan 243
Korce 753
Lac 122
... ... ...
Zimbabwe Shurugwi 120
Umguza 54
Umzingwane 43
mwenezi 106
wedza 109

12749 rows × 1 columns

In [28]:
# To make sure region data compare null regions to not null regions
df_null_region = df.loc[df.region.isnull(),["country"]]
df_null_region = df_null_region.groupby(["country"]).agg(null_regions = ("country", "size"))

df_notnull_region = df.loc[df.region.isna() == False,["country"]]
df_notnull_region = df_notnull_region.groupby(["country"]).agg(notnull_regions = ("country", "size"))

df_null_region_compare = df_notnull_region.merge(df_null_region, how="left", on="country")
df_null_region_compare
Out[28]:
notnull_regions null_regions
country
Afghanistan 2 NaN
Albania 1619 315.0
Armenia 8624 5.0
Azerbaijan 1772 170.0
Belize 122 2.0
... ... ...
Vanuatu 4 NaN
Vietnam 9531 3.0
Yemen 2308 5.0
Zambia 747 28.0
Zimbabwe 3921 52.0

82 rows × 2 columns

In [29]:
# As a sample lets check USA as country
df_null_region_compare.index
Out[29]:
Index(['Afghanistan', 'Albania', 'Armenia', 'Azerbaijan', 'Belize', 'Benin',
       'Bhutan', 'Bolivia', 'Brazil', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Cameroon', 'Chile', 'China', 'Colombia', 'Congo', 'Costa Rica',
       'Cote D'Ivoire', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Georgia', 'Ghana', 'Guatemala', 'Haiti', 'Honduras',
       'India', 'Indonesia', 'Israel', 'Jordan', 'Kenya', 'Kyrgyzstan',
       'Lao People's Democratic Republic', 'Lebanon', 'Lesotho', 'Liberia',
       'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mexico', 'Moldova',
       'Mongolia', 'Mozambique', 'Myanmar (Burma)', 'Namibia', 'Nepal',
       'Nicaragua', 'Nigeria', 'Pakistan', 'Palestine', 'Panama', 'Paraguay',
       'Peru', 'Philippines', 'Rwanda', 'Saint Vincent and the Grenadines',
       'Samoa', 'Senegal', 'Sierra Leone', 'Solomon Islands', 'Somalia',
       'South Africa', 'South Sudan', 'Suriname', 'Tajikistan', 'Tanzania',
       'Thailand', 'The Democratic Republic of the Congo', 'Timor-Leste',
       'Togo', 'Turkey', 'Uganda', 'Ukraine', 'United States', 'Vanuatu',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country')
In [30]:
df_null_region_compare.loc[df_null_region_compare.index == "United States",:]
Out[30]:
notnull_regions null_regions
country
United States 920 5172.0

So the Region data is not reliable and used. It can be deleted:¶

As a sample for United States there are 5172 NaN rows versus 920 normal values. This shows that this column has no reliable data and can be deleted.

In [31]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 646833 entries, 0 to 646832
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       646833 non-null  float64
 1   loan_amount         646833 non-null  float64
 2   activity            646833 non-null  object 
 3   sector              646833 non-null  object 
 4   use                 642934 non-null  object 
 5   country_code        646825 non-null  object 
 6   country             646833 non-null  object 
 7   region              590670 non-null  object 
 8   currency            646833 non-null  object 
 9   term_in_months      646833 non-null  float64
 10  lender_count        646833 non-null  int64  
 11  borrower_genders    642945 non-null  object 
 12  repayment_interval  646833 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 64.2+ MB

Schritt 2: Was machen wir mit den fehlenden Werten?¶

In [32]:
# Delete Region column
df.drop(columns="region", axis=1, inplace=True)
In [33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 646833 entries, 0 to 646832
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       646833 non-null  float64
 1   loan_amount         646833 non-null  float64
 2   activity            646833 non-null  object 
 3   sector              646833 non-null  object 
 4   use                 642934 non-null  object 
 5   country_code        646825 non-null  object 
 6   country             646833 non-null  object 
 7   currency            646833 non-null  object 
 8   term_in_months      646833 non-null  float64
 9   lender_count        646833 non-null  int64  
 10  borrower_genders    642945 non-null  object 
 11  repayment_interval  646833 non-null  object 
dtypes: float64(3), int64(1), object(8)
memory usage: 59.2+ MB

Working with use, borrower_genders NaN values¶

Here the values can not be completed with something like median or andere filling values. That doesn't make sense. In addition these Nan rows exist in both columns. The best option should be deleting the rows. 3900 rows against 640000 rows is not a big issue. But it can make other problems, some countries that falls completely in these rows might lose its representation.

In [34]:
# Delete Nan Rows in Use and Borrower_Gender columns
df.dropna(subset=["use", "borrower_genders"], inplace=True)
In [35]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 642934 entries, 0 to 646832
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       642934 non-null  float64
 1   loan_amount         642934 non-null  float64
 2   activity            642934 non-null  object 
 3   sector              642934 non-null  object 
 4   use                 642934 non-null  object 
 5   country_code        642926 non-null  object 
 6   country             642934 non-null  object 
 7   currency            642934 non-null  object 
 8   term_in_months      642934 non-null  float64
 9   lender_count        642934 non-null  int64  
 10  borrower_genders    642934 non-null  object 
 11  repayment_interval  642934 non-null  object 
dtypes: float64(3), int64(1), object(8)
memory usage: 63.8+ MB
In [36]:
df.isnull().sum()
Out[36]:
funded_amount         0
loan_amount           0
activity              0
sector                0
use                   0
country_code          8
country               0
currency              0
term_in_months        0
lender_count          0
borrower_genders      0
repayment_interval    0
dtype: int64
In [37]:
# Lets check country code
df.loc[df.country_code.isnull(),:]
Out[37]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count borrower_genders repayment_interval
198369 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia NAD 6.0 162 female bullet
198646 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia NAD 6.0 159 male bullet
337622 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia NAD 7.0 120 female bullet
343771 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia NAD 7.0 126 male bullet
411637 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia NAD 7.0 118 female bullet
411895 4000.0 4000.0 Wholesale Wholesale purchase solar lighting products for sale to l... NaN Namibia NAD 7.0 150 male bullet
474672 5100.0 5100.0 Renewable Energy Products Retail to pay for stock of solar lights and cell phon... NaN Namibia NAD 7.0 183 male bullet
475099 5000.0 5000.0 Wholesale Wholesale to maintain a stock of solar lights and cell p... NaN Namibia NAD 7.0 183 female bullet

NaN values for country codes. All the NAN values are for Namibia.¶

That can be the reason system understood the Namibia s country code "NA" as null.

In [38]:
# We make changes
df["country_code"] = df["country_code"].fillna("NA")
df.loc[df.country_code == "NA",:]
Out[38]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count borrower_genders repayment_interval
198369 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia NAD 6.0 162 female bullet
198646 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia NAD 6.0 159 male bullet
337622 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia NAD 7.0 120 female bullet
343771 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia NAD 7.0 126 male bullet
411637 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia NAD 7.0 118 female bullet
411895 4000.0 4000.0 Wholesale Wholesale purchase solar lighting products for sale to l... NA Namibia NAD 7.0 150 male bullet
474672 5100.0 5100.0 Renewable Energy Products Retail to pay for stock of solar lights and cell phon... NA Namibia NAD 7.0 183 male bullet
475099 5000.0 5000.0 Wholesale Wholesale to maintain a stock of solar lights and cell p... NA Namibia NAD 7.0 183 female bullet
In [39]:
df.isnull().sum()
Out[39]:
funded_amount         0
loan_amount           0
activity              0
sector                0
use                   0
country_code          0
country               0
currency              0
term_in_months        0
lender_count          0
borrower_genders      0
repayment_interval    0
dtype: int64

Changing the datatype that we see at first.¶

In [40]:
# Check if "term_in_months" should be converted to integer
cols = ["term_in_months"]
for x in cols:
    signs1 = set()
    for element in df.loc[:,x]:
        if element % 1 > 0:
            signs1.add(element)
    print(f'{x}: {signs1}')
term_in_months: set()
In [41]:
df.term_in_months = pd.to_numeric(df.term_in_months, downcast="integer")
In [42]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 642934 entries, 0 to 646832
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       642934 non-null  float64
 1   loan_amount         642934 non-null  float64
 2   activity            642934 non-null  object 
 3   sector              642934 non-null  object 
 4   use                 642934 non-null  object 
 5   country_code        642934 non-null  object 
 6   country             642934 non-null  object 
 7   currency            642934 non-null  object 
 8   term_in_months      642934 non-null  int16  
 9   lender_count        642934 non-null  int64  
 10  borrower_genders    642934 non-null  object 
 11  repayment_interval  642934 non-null  object 
dtypes: float64(2), int16(1), int64(1), object(8)
memory usage: 60.1+ MB

1.3. Ausreißer¶

Check if we should delete ausreiser¶

In [43]:
fig = px.scatter(data_frame=df, x="funded_amount", y="loan_amount")
fig.show()

1.4. Pairplot - First visual inspection¶

In [44]:
sns.pairplot(data=df,corner=True) 
Out[44]:
<seaborn.axisgrid.PairGrid at 0x22a010bba00>

Findings (Pairplot):

  1. Most of the projects had reached the target funding
  2. It seems that term_in_months has no connection with other data, I decided not to focus on that data.
  3. There is one ausreiser with 100000 loan amount to be checked.
In [45]:
df.loc[df.loan_amount == 100000,:]
Out[45]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count borrower_genders repayment_interval
69745 100000.0 100000.0 Agriculture Agriculture create more than 300 jobs for women and farmer... HT Haiti USD 75 2986 female irregular

Decided not to delete because the project is big and lender_count is also big,¶

so the data is reasonable and won't make harm to our report.

Important finding about initiaters¶

We saw that most initiaters are women. And that can say something important to us. That's way we have decided to show the borrower_genders column in a better way. Existing structure of the column is not usable. For this we decided to make two new column from this column showing the number of male, female initiaters.

In [46]:
# New columns for borrower genders
df["borrower_male_count"] = df.borrower_genders.map(lambda x: [i.strip() for i in x.split(",")]) \
    .map(lambda x: Counter(x)).map(lambda x: x["male"])

df["borrower_female_count"] = df.borrower_genders.map(lambda x: [i.strip() for i in x.split(",")]) \
    .map(lambda x: Counter(x)).map(lambda x: x["female"])
In [47]:
df.loc[45:55,:]
Out[47]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count borrower_genders repayment_interval borrower_male_count borrower_female_count
45 150.0 150.0 Farming Agriculture to purchase potato seeds and fertilizers for c... IN India INR 43 6 female bullet 0 1
46 250.0 250.0 Fruits & Vegetables Food to purchase sacks of tomatoes, potatoes, fruit... KE Kenya KES 13 7 female irregular 0 1
47 250.0 250.0 Farming Agriculture to purchase potato seeds and fertilizers for g... IN India INR 43 10 female bullet 0 1
48 600.0 600.0 Machinery Rental Services to invest in working capital and to maintain g... NI Nicaragua NIO 14 16 female monthly 0 1
49 450.0 450.0 General Store Retail to stock his store. SV El Salvador USD 14 18 male monthly 1 0
50 3175.0 3175.0 Butcher Shop Food to buy meat and also to start selling fish in ... TZ Tanzania TZS 10 93 male, male, male, male, male monthly 5 0
51 175.0 175.0 Pigs Agriculture to buy piglets and feed PH Philippines PHP 8 6 female irregular 0 1
52 175.0 175.0 Pigs Agriculture to purchase feed and vitamins for her pigs PH Philippines PHP 8 7 female irregular 0 1
53 550.0 550.0 Personal Expenses Personal Use to buy a cradle and household items for his yo... PE Peru PEN 14 20 male monthly 1 0
54 225.0 225.0 Food Market Food to purchase various seasonal items to resell: ... SN Senegal XOF 14 7 female monthly 0 1
55 700.0 700.0 General Store Retail to buy additional items like eggs, charcoal, r... PH Philippines PHP 5 2 female irregular 0 1

Now we don't need borrower_genders column anymore. So we can delete it.¶

In [48]:
df.drop(columns=["borrower_genders"], axis=1, inplace=True)
df.head(5)
Out[48]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count repayment_interval borrower_male_count borrower_female_count
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan PKR 12 12 irregular 0 1
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan PKR 11 14 irregular 0 2
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India INR 43 6 bullet 0 1
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan PKR 11 8 irregular 0 1
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan PKR 14 16 monthly 0 1

Correlation Matrix¶

In [49]:
#Correlation Matrix
import matplotlib.pyplot as plt
corr = df.corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, annot=True, 
    cmap = 'coolwarm', square=True)
plt.title('Correlation between different features')
corr
Out[49]:
funded_amount loan_amount term_in_months lender_count borrower_male_count borrower_female_count
funded_amount 1.000000 0.951415 0.153453 0.846736 0.197196 0.471324
loan_amount 0.951415 1.000000 0.190599 0.801527 0.209109 0.446061
term_in_months 0.153453 0.190599 1.000000 0.233387 -0.028207 -0.154234
lender_count 0.846736 0.801527 0.233387 1.000000 0.170906 0.264898
borrower_male_count 0.197196 0.209109 -0.028207 0.170906 1.000000 0.173048
borrower_female_count 0.471324 0.446061 -0.154234 0.264898 0.173048 1.000000

Findings (Correlation Matrix):

  1. loan_amount had dependacy with funded_amount, that is clear.
  2. lender_count had dependancy with loan_amount and funded_amount
  3. It seems that ther values had no big dependancy

1.5. Features¶

New Feature: Funding amount pro lender¶

After checking the Correlation Matrix, I saw that Lender count has dependancy between funded amount. So funding amount for each lender can say something to us at least: How much do people generally make funding in an event?

In [50]:
df["funding_pro_lender"] = df.funded_amount / df.lender_count
df.head(5)
Out[50]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count repayment_interval borrower_male_count borrower_female_count funding_pro_lender
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan PKR 12 12 irregular 0 1 25.000000
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan PKR 11 14 irregular 0 2 41.071429
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India INR 43 6 bullet 0 1 25.000000
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan PKR 11 8 irregular 0 1 25.000000
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan PKR 14 16 monthly 0 1 25.000000
In [51]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 642934 entries, 0 to 646832
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   funded_amount          642934 non-null  float64
 1   loan_amount            642934 non-null  float64
 2   activity               642934 non-null  object 
 3   sector                 642934 non-null  object 
 4   use                    642934 non-null  object 
 5   country_code           642934 non-null  object 
 6   country                642934 non-null  object 
 7   currency               642934 non-null  object 
 8   term_in_months         642934 non-null  int16  
 9   lender_count           642934 non-null  int64  
 10  repayment_interval     642934 non-null  object 
 11  borrower_male_count    642934 non-null  int64  
 12  borrower_female_count  642934 non-null  int64  
 13  funding_pro_lender     639771 non-null  float64
dtypes: float64(3), int16(1), int64(3), object(7)
memory usage: 86.0+ MB

Because of the unfunded values new NaNs have been created. We can fill them with median()¶

I think making median by countries seems more sensible. I will try to make so. (Tested but did not work)

### A test to country based median filling¶

df_neu = df.copy() for i in df_neu.country.unique(): if i == "Virgin Islands": median_v = df_neu.loc[:,'funding_pro_lender'].median() df_neu.loc[df_neu.country == i,:].replace(np.nan, median_v, inplace=True) else: median_v = df.loc[df['country']==i,'funding_pro_lender'].median() df.loc[df.country == i,:].replace(np.nan, median_v, inplace=True)

In [52]:
# Fill NaN values with median
median_v = df.loc[:,'funding_pro_lender'].median()
df.replace(np.nan, median_v, inplace=True)
df.head(5)
Out[52]:
funded_amount loan_amount activity sector use country_code country currency term_in_months lender_count repayment_interval borrower_male_count borrower_female_count funding_pro_lender
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan PKR 12 12 irregular 0 1 25.000000
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan PKR 11 14 irregular 0 2 41.071429
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India INR 43 6 bullet 0 1 25.000000
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan PKR 11 8 irregular 0 1 25.000000
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan PKR 14 16 monthly 0 1 25.000000
In [53]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 642934 entries, 0 to 646832
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   funded_amount          642934 non-null  float64
 1   loan_amount            642934 non-null  float64
 2   activity               642934 non-null  object 
 3   sector                 642934 non-null  object 
 4   use                    642934 non-null  object 
 5   country_code           642934 non-null  object 
 6   country                642934 non-null  object 
 7   currency               642934 non-null  object 
 8   term_in_months         642934 non-null  int16  
 9   lender_count           642934 non-null  int64  
 10  repayment_interval     642934 non-null  object 
 11  borrower_male_count    642934 non-null  int64  
 12  borrower_female_count  642934 non-null  int64  
 13  funding_pro_lender     642934 non-null  float64
dtypes: float64(3), int16(1), int64(3), object(7)
memory usage: 86.0+ MB

1.6. Storage space optimisation¶

For storage space optimization

  • Check the data types, downcast if possible
  • Check objects, categorize if possible
In [54]:
df.dtypes
Out[54]:
funded_amount            float64
loan_amount              float64
activity                  object
sector                    object
use                       object
country_code              object
country                   object
currency                  object
term_in_months             int16
lender_count               int64
repayment_interval        object
borrower_male_count        int64
borrower_female_count      int64
funding_pro_lender       float64
dtype: object
In [55]:
df.memory_usage(deep=True) 
Out[55]:
Index                    22051800
funded_amount             5143472
loan_amount               5143472
activity                 44507999
sector                   41716861
use                      75870145
country_code             37933106
country                  42081570
currency                 38576040
term_in_months            1285868
lender_count              5143472
repayment_interval       41581393
borrower_male_count       5143472
borrower_female_count     5143472
funding_pro_lender        5143472
dtype: int64
In [56]:
df.memory_usage()
Out[56]:
Index                    22051800
funded_amount             5143472
loan_amount               5143472
activity                  5143472
sector                    5143472
use                       5143472
country_code              5143472
country                   5143472
currency                  5143472
term_in_months            1285868
lender_count              5143472
repayment_interval        5143472
borrower_male_count       5143472
borrower_female_count     5143472
funding_pro_lender        5143472
dtype: int64
In [57]:
df_opti = df.copy()
In [58]:
# Downcast floats
df_opti[["funded_amount","loan_amount", "funding_pro_lender"]] = df_opti[["funded_amount", "loan_amount", "funding_pro_lender"]].apply(pd.to_numeric, downcast="float")
df_opti.dtypes  
Out[58]:
funded_amount            float32
loan_amount              float32
activity                  object
sector                    object
use                       object
country_code              object
country                   object
currency                  object
term_in_months             int16
lender_count               int64
repayment_interval        object
borrower_male_count        int64
borrower_female_count      int64
funding_pro_lender       float32
dtype: object
In [59]:
# Downcast integers
df_opti[["term_in_months", "lender_count","borrower_male_count", "borrower_female_count"]] = df_opti[["term_in_months", "lender_count","borrower_male_count", "borrower_female_count"]].apply(pd.to_numeric, downcast="integer")
df_opti.dtypes  
Out[59]:
funded_amount            float32
loan_amount              float32
activity                  object
sector                    object
use                       object
country_code              object
country                   object
currency                  object
term_in_months             int16
lender_count               int16
repayment_interval        object
borrower_male_count         int8
borrower_female_count       int8
funding_pro_lender       float32
dtype: object
In [60]:
# Check if we can convert objects to category

liste = df_opti.select_dtypes("object").columns 

for spalte in liste:
    print(spalte, df_opti.loc[:,spalte].nunique())
activity 163
sector 15
use 424912
country_code 87
country 87
currency 67
repayment_interval 4

Except use we can convert all other objects¶

In [61]:
# Umwandlung in Category Datentyp - mit astype() 

df_opti = df_opti.copy()

liste_category = ['activity','sector','country_code', 'country', 'currency', "repayment_interval"]

for spalte in liste_category:
    df_opti.loc[:,spalte] = df_opti.loc[:,spalte].astype('category')

    
# check, ob Umwandlung erfolgreich

df_opti.dtypes 
Out[61]:
funded_amount             float32
loan_amount               float32
activity                 category
sector                   category
use                        object
country_code             category
country                  category
currency                 category
term_in_months              int16
lender_count                int16
repayment_interval       category
borrower_male_count          int8
borrower_female_count        int8
funding_pro_lender        float32
dtype: object
In [62]:
df.memory_usage(deep=True)
Out[62]:
Index                    22051800
funded_amount             5143472
loan_amount               5143472
activity                 44507999
sector                   41716861
use                      76437392
country_code             37933106
country                  42081570
currency                 38576040
term_in_months            1285868
lender_count              5143472
repayment_interval       41581393
borrower_male_count       5143472
borrower_female_count     5143472
funding_pro_lender        5143472
dtype: int64
In [63]:
df_opti.memory_usage(deep=True)
Out[63]:
Index                    22051800
funded_amount             2571736
loan_amount               2571736
activity                  1301329
sector                     644481
use                      76437392
country_code               650171
country                    650758
currency                   649058
term_in_months            1285868
lender_count              1285868
repayment_interval         643362
borrower_male_count        642934
borrower_female_count      642934
funding_pro_lender        2571736
dtype: int64
In [64]:
df_opti.dtypes
Out[64]:
funded_amount             float32
loan_amount               float32
activity                 category
sector                   category
use                        object
country_code             category
country                  category
currency                 category
term_in_months              int16
lender_count                int16
repayment_interval       category
borrower_male_count          int8
borrower_female_count        int8
funding_pro_lender        float32
dtype: object

Export prepared data for visualisation¶

In [65]:
df_opti.to_pickle('df_after_preprocessing.pkl')